23 research outputs found

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

    Full text link
    Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.Comment: In Proceeding of 30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC

    Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters*

    No full text
    Abstract—Network congestion is an important factor affecting the performance of large scale jobs in supercomputing clusters, especially with the wide deployment of multi-core processors. The blocking nature of current day collectives makes such congestion a critical factor in their performance. On the other hand, modern interconnects like InfiniBand provide us with many novel features such as Virtual Lanes aimed at delivering better performance to end applications. Theoretical research in the field of network congestion indicate Head of Line (HoL) blocking as a common causes for congestion and the use of multiple virtual lanes as one of the ways to alleviate it. In this context, we make use of the multiple virtual lanes provided by the InfiniBand standard as a means to alleviate network congestion and thereby improve the performance of various high performance computing applications on modern multi-core clusters. We integrate our scheme into th

    Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand ∗

    No full text
    Message Oriented Middleware (MOM) is a key technology in financial market data delivery. In this context we study the Advanced Message Queuing Protocol (AMQP), an emerging open standard for MOM communication. We design a basic suite of benchmarks for AMQP’s Direct, Fanout, and Topic Exchange types. We then evaluate these benchmarks with Apache Qpid, an open source implementation of AMQP. In order to observe how AMQP performs in a real-life scenario, we also perform evaluations with a simulated stock exchange application. All our evaluations are performed over InfiniBand as well as 1 Gigabit Ethernet networks. Our results indicate that in order to achieve thehighscalabilityrequirementsdemandedbyhighperformance computational finance applications, we need to use moderncommunicationprotocols,like RDMA, whichplace less processing load on the host. We also find that the centralizedarchitectureof AMQPpresentsaconsiderablebottleneckasfarasscalabilityisconcerned. Keywords: AMQP,Computational Finance, InfiniBan
    corecore